Red Wine Quality Exploration by Jeff Sinclair

This report explores a dataset containing information regarding the quality of red wine for approximately 1,599 wines. The purpose of this exploration is to try and get a better understanding of which chemical properties influence the quality of red wines?

High level summary

##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

The summary of the wine data confirms we have 1,599 wine samples that the analysis will be based on, with 12 attributes for each wine. Our target variable of interest is quality as we want to determine if there are other features in the data set that are particularly correlated to a higher or lower quality of wine.

However it looks like there is one variable in there (“X”) that is not actually descriptive of my data set and is rather just an index in the data that increments for each row. Let’s remove that.

Now let’s reprint those names and see if that looks better.

##  [1] "fixed.acidity"        "volatile.acidity"     "citric.acid"         
##  [4] "residual.sugar"       "chlorides"            "free.sulfur.dioxide" 
##  [7] "total.sulfur.dioxide" "density"              "pH"                  
## [10] "sulphates"            "alcohol"              "quality"

It looks as though some scaling will be required as the range of values for free.sulfur.dioxide and total.sulfur.dioxide seem to be on a drastically different scale than the remainder.

Let’s look at the distribution of wine quality over the data set.

Quality data distribution

The histogram demonstrates that the wine samples roughly follow a normal distribution, with the majority of wines scoring a 5 or a 6 on the quality scale. This is confirmed from the summary above showing the mean of 5.636 and median of 6. Interestingly no wines scored a 9 or 10, and nothing scored below a 3. I wonder if this is due to some inherent bias from the individual assigning quality scores, of not wanting to grossly over or under mark any wines that they tested.

The bias in this data toward the middle of the quality curve is likely to mean that we can find predictors of mid quality wine, but not necessarily high quality (as the data will be too sparse to effectively identify high quality wines).

Looking into the attributes available

So, on to looking for the magic ingredients that make wine “good” quality. Let’s start by plotting every feature independently on a histogram.

Fixed acidity:

Volatile acidity:

## Warning: Removed 4 rows containing non-finite values (stat_bin).
## Warning: Removed 1 rows containing missing values (geom_bar).

Citric acid:

## Warning: Removed 1 rows containing non-finite values (stat_bin).

Residual sugar:

## Warning: Removed 29 rows containing non-finite values (stat_bin).
## Warning: Removed 1 rows containing missing values (geom_bar).

Then chlorides

## Warning: Removed 43 rows containing non-finite values (stat_bin).

Free sulfur dioxide:

## Warning: Removed 16 rows containing non-finite values (stat_bin).
## Warning: Removed 1 rows containing missing values (geom_bar).

Total sulfur dioxide:

## Warning: Removed 9 rows containing non-finite values (stat_bin).
## Warning: Removed 1 rows containing missing values (geom_bar).

Density:

pH level:

## Warning: Removed 8 rows containing non-finite values (stat_bin).
## Warning: Removed 1 rows containing missing values (geom_bar).

Sulphates:

## Warning: Removed 17 rows containing non-finite values (stat_bin).
## Warning: Removed 1 rows containing missing values (geom_bar).

Alcohol:

## Warning: Removed 1 rows containing non-finite values (stat_bin).
## Warning: Removed 1 rows containing missing values (geom_bar).

Attributes vs. Quality

Now that we have a better understanding of the distribution of each individual element, let’s compare features to quality to see if trends start to emerge on key features of high quality wine.

Maybe more alcoholic wines are considered of higher quality? I will look at these two features plotted against each other to see if there is any association between the two.

This plot indicates there is a weak relationship between alcohol content and quality of wine, as alcohol content increases there is a moderate relationship to higher quality wines (with very low quality wines all being relatively low alcohol content). However this does not visually appear to be highly correlated as there is a wide distribution of strength and quality once the quality score reaches 5 or greater.

Based on the descriptive information that helps readers understand what these different features relate to, there are some other potential characteristics of interest.

Let’s explore the relationship between quality and volatile acidity. Based on the descriptive information we should expect to see a ceiling on quality once volatile acidity reaches over a given amount.

There is definitely a trend here, with very few wines having more than 0.8 gm of acetic acid in the highest quality buckets - but interestingly not many wines in general seem to be over this level (even the lowest quality wines).

Next I want to look at residual sugar levels, to see if sweet red wine is considered higher quality than dry red wine.

## Warning: Removed 86 rows containing missing values (geom_point).

It looks as though the higher the quality wine will contain less residual sugar - indicating that sweet red wines are not considered to be of high quality. What about chlorides / salt? It would seem that salty wine is not desirable, so I think that we’ll see low salt in wine across the board regardless of quality level.

## Warning: Removed 69 rows containing missing values (geom_point).

Using boxplots to explore the distribution of data, vs. a jitter plot

As expected, the highest quality wines all have very low levels of chlorides. The interesting part is that mid quality wines seem to have the highest salt levels compared to even the lower quality wines (see the upper bound on quality 5). I want to look at this as a boxplot to see if that shows a better story of the distribution of this particular feature.

## Warning: Removed 69 rows containing non-finite values (stat_boxplot).

The box plot clearly shows the mid tier wines having significant outliers in terms of salt content (mostly in a negative direction). This is very interesting, and seems like it could be a candidate for an ingredient to move a mid tier wine to a top tier wine (controlled salt content).

Continuing the investigation of individual attributes related to quality, next I want to look at the different types of sulfur and their relationship to quality.

First free sulfur dioxide:

## Warning: Removed 45 rows containing missing values (geom_point).

And total sulfur dioxide:

## Warning: Removed 9 rows containing missing values (geom_point).

Again, similar trends as would be expected based on the descriptive information provided about these attributes. Higher quality wines generally seem to have a lower quantity of sulfur dioxide (either free or total), however there are wines at lower quality levels that have the same amount of sulfur - so that is indicative that this is not a silver bullet.

What about density of the water or pH level of the wine?

Water density:

pH level:

Neither density or pH seem to have a strong relationship with quality; both attributes are fairly distributed for high quality wines and there is no strong trend that can be drawn (positive or inverse) vs. wine quality based on these plots.

Finally I want to look at the level of sulphates in the wine:

## Warning: Removed 17 rows containing missing values (geom_point).

Again this plot seems to indicate there is a small relationship between quality and amount of sulphates (there should be a little, but not a lot, somewhere in the 0.75 range).

Introducing multiple variables into the plots

Based on all these individual features vs. quality plots - there is no standout “golden bullet” that is required to make a wine high quality. There are things that should be avoided (such as high salt content) but no one thing that will give a high quality wine - which is generally what I had expected before starting this analysis given the complexity of the subject area.

Therefore we need to look at multiple elements in combination to see if there are groups of attributes that do have a direct impact on the quality of wine.

## Warning: Removed 22 rows containing missing values (geom_point).

This plot is now starting to get interesting. Showing the relationship between the amount of chlorides (which we already knew was lower for high quality wines), and alcohol content. We can see an increasing relationship between the high alcohol content wines (shown by the lighter dots), combined with low chlorides.

I broke out the jitter plot to try and add more insight into what was happening in the over crowded 5/6 quality zone.

Let’s see what other relationships exist.

Using facets

Let’s use facets instead of a visual determinant to help visually separate the different quality categories

## Warning: Removed 21 rows containing missing values (geom_point).

The graphs above show that there is a tighter clustering of volatile acidity (confirming what we had looked at previously as an independent variable), however fixed acidity does not seem to have such a direct correlation to quality of wine.

## Warning: Removed 108 rows containing missing values (geom_point).

Similar story here, residual sugar and chlorides must be low for a high quality wine. There is not a huge difference in clustering in a quality 8 wine even when compared to the quality 4 wines though.

These features do not seem to be important toward quality detection. The pattern and the relationship of sulfur dioxide levels is pretty consistent between low and high quality wines (high quality wines are more clustered, but there are also less wines at this level).

Using a data model to find the correlated features

Ok. So all of the plots that I’ve built so far have given some insight into potentially correlated features to quality wine. However I’d like to explore whether we can use a correlation function to more scientifically identify the features that are most correlated with a high quality wine.

I decided to try and train a random forest model to attempt to select the most important features in the wine data for high quality wine. Based on the output of this, I can continue plotting these features to have better insights into my data.

## Random Forest 
## 
## 1117 samples
##   11 predictor
##    6 classes: '3', '4', '5', '6', '7', '8' 
## 
## No pre-processing
## Resampling: Cross-Validated (3 fold) 
## Summary of sample sizes: 743, 745, 746 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy  Kappa 
##    2    0.6696    0.4625
##    6    0.6589    0.4489
##   11    0.6535    0.4421
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.

Thhis is showing that the model is somewhat predictive (averaging at ~0.69), which is ok but not great. What about if we look at this on a per-quality basis?

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   3   4   5   6   7   8
##          3   0   0   0   0   0   0
##          4   1   0   0   1   0   0
##          5   1  11 154  39   3   0
##          6   1   5  50 137  30   5
##          7   0   0   1  15  27   0
##          8   0   0   0   0   0   1
## 
## Overall Statistics
##                                          
##                Accuracy : 0.6618         
##                  95% CI : (0.6177, 0.704)
##     No Information Rate : 0.4253         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.4517         
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: 3 Class: 4 Class: 5 Class: 6 Class: 7 Class: 8
## Sensitivity          0.000000 0.000000   0.7512   0.7135  0.45000 0.166667
## Specificity          1.000000 0.995708   0.8051   0.6862  0.96209 1.000000
## Pos Pred Value            NaN 0.000000   0.7404   0.6009  0.62791 1.000000
## Neg Pred Value       0.993776 0.966667   0.8139   0.7835  0.92483 0.989605
## Prevalence           0.006224 0.033195   0.4253   0.3983  0.12448 0.012448
## Detection Rate       0.000000 0.000000   0.3195   0.2842  0.05602 0.002075
## Detection Prevalence 0.000000 0.004149   0.4315   0.4730  0.08921 0.002075
## Balanced Accuracy    0.500000 0.497854   0.7781   0.6999  0.70604 0.583333

The two metrics of interest are sensitivity and specificity - sensitivity being the rate of true positives as a ratio of all positivies, and specificity being the rate of true negatives as a rate of all negatives.

This output indicates that this model is best geared for predicting a quality 5 or 6 wine, and it is very good at predicting what is NOT the other classes of wines, but no so good at predicting what is (e.g. 1.000 specificity for quality = 8, but 0.167 sensitivity).

This is likely due to the large amounts of bias in the data toward those middle classes, meaning these is simply a lot more data available for the model to be trained off in class 5 and 6, and not as much to work with for the other classes. If the data was more evenly distributed then the output would likely be improved.

Feature importance

What features is the model selecting in predicting the quality of a wine? This will help ensure that I’m looking at the right elements in my final plots.

Looking at a graph of feature importance, it looks by far like the most important feature is alcohol content, followed by sulphates and volatile.acidity. However the features in the dataset are much more likely to be predictors for whether a wine is medium quality - rather than low or high quality (which is interesting).

Based on a fresh perspective from this model, let’s replot alcohol content and sulphates against each other.

The feature importance in the graph would seem to be highlighting the cluster of low alcohol wines that are in the “5” quality bucket; likely the thing that the random forest model picked up on (that lower alcohol wines are correlated to a 5 or 6 quality wine based on this data set).

Finally to make sure that we haven’t missed anything, a quick plot of the pearson correlation of all the attributes in our model to ensure that we haven’t missed any valuable plots before we move into the final analysis.

I think based on the analysis done so far in this project, I can now start to finalise my plots and draw some conclusions.

Final Plots and Summary

Let’s start by re-iterating the purpose of this data exploration:

Which chemical properties influence the quality of red wines?

The best way to start answering a question such as the one above, is to look at the relative correlations of all the feautres in the dataset available, plotted against each other. However more than what I did in the plot above, we should ensure that the chart is visually representative of the hotspots of correlation and whether the correlations are significant (statistically) or not.

The chart above highlights that there are a handful of features positively correlated with wine quality, namely:

  1. Alcohol Content
  2. Citric Acid content
  3. Sulphates
  4. Fixed Acidity of the wine

In addition, there are some features that are negatively correlated with the quality of the wine:

  1. Volatile Acidity
  2. Density
  3. Total Sulfur Dioxide
  4. Chlorides

From my earlier analysis, there was an interesting relationship present between the Alcohol content, the Sulphates in the wine, and the quality score attained. Let’s visualise that relationship:

## Warning: Removed 14 rows containing missing values (geom_point).

The plot above shows a relationship between all the provided attributes - we can see as wine quality is increasing that there is an increasing amount of citric acid (however not reaching the top of the citric acid range, settling between 0.50 and 0.75 g/dm^3), a similar amount of potassium sulphate and at least 12% alcohol by volume.

Similarly, we can plot the features that were negatively correlated with quality to see how that differs on a plot:

## Warning: Removed 9 rows containing missing values (geom_point).

The plot above demonstrates the less correlated features, with higher sulfur dioxide content, higher volatile acidity or higher denisty values being associated with lower quality wines. Note that as with the results noted previously from the Random Forest model employed during my initial analysis, it is possible that the skew in the sample data is affecting these results.

That is, it is possible that due to the lower quantity of “high quality” wines, that we are missing out on correlating these features with high quality wines purely due to the sparsity of data.

So in conclusion, if a wine maker would like to ensure their wines are of high quality - there is unfortunately no silver bullet. The US wine market is responsible for approximately $220bn worth of value each year toward the american economy - so it stands to reason this is a complicated problem.

The analysis above highlights if you want to improve quality there are 6-8 features of your wine that you can double down on which will maximise your chances of producing a good wine, so get mixing!

Reflection

The red wine data set contained 1599 data samples covering 13 variables from around 2009. I started out looking at what variables were available in the dataset, and looking at the distribution of my data. Then based on the descriptive text provides about the attributes I started exploring the data to see if there were any immediately obvious relationships between the features and quality of the wine. After bottoming out individual features, I then started exploring pairs of features and did start to notice more trends emerging.

I then took a turn in my analysis and decided to use a more data driven approach, opting to build a simple ML model (using a random forest predictor) to try and classify the data based on the available features and highlight which features were the most predictive of quality. The model had varying levels of success, mainly due to the highly sparse data, and was more predictive for 5/6 quality wines rather than all wines.

Using a data driven approach seemed to work well, and helped rather than manually exploring all the potential pairs of features in the data set. However to improve the data set would need to be larger (ideally updated from 2009 to current date), so we could have more data and potentially even examine the trend in quality vs. chemical composition over time as potentially tastes have changed in the last 8-9 years.

More data would help here as the sparsity of the data set and bias toward the middle quality tiers causes the data to be less predictive (e.g. the model will say most things are a 5 or 6, because most things were a 5 or 6 in the provided data).